Skip to content

Add columnar data access for memory-efficient row processing#975

Merged
jayantsing-db merged 9 commits into
databricks:mainfrom
jayantsing-db:jayantsing-db/fetch-requests-columnar-view
Sep 11, 2025
Merged

Add columnar data access for memory-efficient row processing#975
jayantsing-db merged 9 commits into
databricks:mainfrom
jayantsing-db:jayantsing-db/fetch-requests-columnar-view

Conversation

@jayantsing-db
Copy link
Copy Markdown
Collaborator

Description

This PR contains changes from the PR #966 as well.

Introduce ColumnarRowView to provide direct access to columnar data without
materialising entire result sets into row objects. This reduces memory
allocations by allowing individual cell access via getValue(row, col)
instead of creating List<List<Object>> structures.

Key changes:

  • New ColumnarRowView class with direct columnar access methods.
  • Updated LazyThriftResult to use columnar view instead of materialized rows.
  • Added utility method in DatabricksThriftUtil for creating columnar views.
  • Comprehensive test coverage for all column types and null handling.

This optimization maintains API compatibility while significantly reducing
memory overhead for large result sets.

Following the changes introduced in PR #966, the following improvements were
observed during a test that executes a SQL query retrieving 5 million rows:

Current heap usage over time:
image

Improved heap usage over time:
image

  • The first image shows multiple significant CPU usage spikes reaching around 15%, while the second image shows consistently flat CPU usage at approximately 3%.
  • The erratic CPU behavior in the "before" state has been completely smoothed out.
  • CPU usage is now consistent and predictable rather than volatile.
  • Memory usage dropped from a peak of 8440 MB down to just 745 MB - that's roughly an 91% decrease.
  • The large memory spike and sustained high usage in the first image has been completely resolved.
  • Memory usage is now consistently low and stable throughout the monitoring period.
  • There is no negative effect on execution time; in fact, it appears to improve, likely due to more active GC pauses in the previous state.

Testing

  • Unit tests
  • e2e tests
  • FakeService tests

Additional Notes to the Reviewer

}

/** Interface for accessing column values by index without materializing the entire column. */
private interface ColumnAccessor {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use separate files for interface and impl

if (column.isSetStringVal()) return column.getStringVal().getValuesSize();

throw new DatabricksSQLException(
"Unsupported column type: " + column, DatabricksDriverErrorCode.UNSUPPORTED_OPERATION);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about complex datatypes? Will they also be covered in above primitive types?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support these

private static List<?> getColumnValues(TColumn column) throws DatabricksSQLException {
columns. There is nothing new added or removed in these changes. If complex types come as binary (which i think is the case), complex types are supported. Otherwise, not and this is the current behaviour too.

* out of bounds
*/
@Override
public Object getObject(int columnIndex) throws DatabricksSQLException {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work out of box? You return primitive types from ColumnAccessor, and here we can have complex types as well. Will the conversion happen implicitly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a binary type as well. Added more details #975 (comment) in this comment.

@jayantsing-db jayantsing-db enabled auto-merge (squash) September 10, 2025 12:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a memory-efficient columnar data access mechanism for JDBC result processing. Instead of materializing entire result sets into List<List<Object>> structures, it provides direct access to columnar data through a new ColumnarRowView class, resulting in significant memory reduction (up to 91% in testing) and improved CPU performance.

  • Introduces ColumnarRowView class for memory-efficient row-by-row data access
  • Updates LazyThriftResult to use columnar views instead of materialized row lists
  • Adds utility method in DatabricksThriftUtil for creating columnar views

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
ColumnarRowView.java New class providing direct columnar access with getValue(row, col) method and null handling
LazyThriftResult.java Refactored to use ColumnarRowView instead of List<List<Object>> for batch processing
DatabricksThriftUtil.java Added createColumnarView() utility method as memory-efficient alternative
ColumnarRowViewTest.java Comprehensive test coverage for all column types, null handling, and boundary conditions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +144 to +154
private final BitSet nullBits;

public TypedColumnAccessor(List<T> values, byte[] nulls) {
this.values = values;
this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null;
}

@Override
public Object getValue(int rowIndex) {
if (nullBits != null && nullBits.get(rowIndex)) {
return null;
Copy link

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new BitSet for every column accessor could be memory-intensive for large datasets. Consider lazy initialization or caching the BitSet creation to improve memory efficiency.

Suggested change
private final BitSet nullBits;
public TypedColumnAccessor(List<T> values, byte[] nulls) {
this.values = values;
this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null;
}
@Override
public Object getValue(int rowIndex) {
if (nullBits != null && nullBits.get(rowIndex)) {
return null;
private final byte[] nulls;
private BitSet nullBits;
public TypedColumnAccessor(List<T> values, byte[] nulls) {
this.values = values;
this.nulls = nulls;
this.nullBits = null; // Lazy initialization
}
@Override
public Object getValue(int rowIndex) {
if (nulls != null) {
if (nullBits == null) {
nullBits = BitSet.valueOf(nulls);
}
if (nullBits.get(rowIndex)) {
return null;
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Thanks. Will implement in subsequent PR.

@jayantsing-db jayantsing-db merged commit 43ac32b into databricks:main Sep 11, 2025
12 of 13 checks passed
jayantsing-db added a commit that referenced this pull request Jan 19, 2026
## Description
<!-- Provide a brief summary of the changes made and the issue they aim
to address.-->
This PR introduces lazy loading support for inline Arrow results to
improve memory efficiency when handling large result sets.

Previously, InlineChunkProvider would eagerly fetch all arrow batches
upfront when results had hasMoreRows = true, which could lead to memory
issues with large datasets. This change splits the handling into two
separate paths:
1. Lazy path (new): For Thrift-based inline Arrow results (when
ARROW_BASED_SET is returned), we now use LazyThriftInlineArrowResult
which fetches arrow batches on-demand as the client iterates through
rows. This is similar to how LazyThriftResult works for columnar data.
2. Remote path (existing): For URL-based Arrow results (URL_BASED_SET),
we continue using ArrowStreamResult with RemoteChunkProvider which
downloads chunks from cloud storage.

The InlineChunkProvider is now only used for SEA results with JSON_ARRAY
format and INLINE disposition (contain all data inline {no hasMoreRows
flag set}).

This will reduce memory consumption and improve performance when dealing
with large inline Arrow result sets similar to
#975.

## Testing
<!-- Describe how the changes have been tested-->
- Unit tests
- Integration tests
- Manual testing

## Additional Notes to the Reviewer
<!-- Share any additional context or insights that may help the reviewer
understand the changes better. This could include challenges faced,
limitations, or compromises made during the development process.
Also, mention any areas of the code that you would like the reviewer to
focus on specifically. -->
Bypassing an existing failure on CI/CD because of 3e4f21c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants